[bugfix][torchair] fix kv_nz accuracy problem and remove redundant reshape_and_cache operation#3066
[bugfix][torchair] fix kv_nz accuracy problem and remove redundant reshape_and_cache operation#3066linfeng-yuan wants to merge 1 commit into
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request correctly removes a redundant _npu_reshape_and_cache operation that was being called in torchair graph mode. This is a good simplification, as caching is already handled by npu_kv_rmsnorm_rope_cache in that scenario. However, an important assertion checking the kv_cache size was also removed. I've recommended re-adding it to ensure code robustness and prevent potential runtime errors.
1aace42 to
3893cd1
Compare
…cache operation Signed-off-by: linfeng-yuan <1102311262@qq.com>
3893cd1 to
d745a8e
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3066 +/- ##
==========================================
- Coverage 74.76% 71.96% -2.81%
==========================================
Files 150 168 +18
Lines 20891 23544 +2653
==========================================
+ Hits 15620 16943 +1323
- Misses 5271 6601 +1330
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@linfeng-yuan Thanks a lot!!! |
|
@linfeng-yuan This pull request only fixes one accuracy problem. Tests show accuracy is fine without KV NZ enabled, but still problematic when it's on, even after applying this change. The GSM-8K benchmark scores are still too low with KV NZ active.... |
|
@linfeng-yuan The |
jianzs
left a comment
There was a problem hiding this comment.
This PR cannot be merged until kv_nz is supported during the prefill phase.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
Closed since torchair is dropped. |
What this PR does / why we need it?
This PR removes redundant calling of
reshape_and_cacheoperation at prefilling stage with torchair graph mode. This reduces prefilling latency as well as fixes accuracy problem whileenable_kv_nzis enabled. Although #2988 fixesenable_kv_nzaccuracy problem, the output tokens with deepseek is inaccurate, leading to a decline in benchmark scoring.Does this PR introduce any user-facing change?
No.
How was this patch tested?
We run e2e online serving and accuracy test containing eager mode with
enable_shared_expert_dpand torchair graph mode withenable_kv_nz.